Exploratory Data Analysis


Overview

To begin our project, this notebook performs an exploratory analysis of the IBM HR Analytics Employee Attrition dataset. We investigate the factors that lead to attrition, which represents employees leaving the company (either voluntarily or involuntarily). - The overall goal is not only to build a predictive model for the target Attrition, but to discover specific changes the business could make to reduce it. - Attrition poses a significant cost to organizations through lost productivity, rehiring expenses, and weakened team morale. If there are ways to help prevent

Notebook Outline

  1. Load and validate data
  2. Initial data summary
  3. Univariate analysis
  4. Affect on attrition
  5. Feature Correlation

1. Load and validate data


First, we load the IBM HR Analytics Employee Attrition & Performance dataset from the data/raw/ directory. We verify the dataset was read correctly and perform a basic inspection. - Shape: 1470 rows × 35 columns. - Target: Attrition - Yes = left the company, No = still employed at the time of data collection. - Data types: - int64: 26 columns — numerical features (Age, MonthlyIncome, DistanceFromHome, …). - object: 9 columns — categorical features (Gender, JobRole, BusinessTravel, …). - No null values: - According to .info(), there are no missing values in any column. - Columns requiring attention before modeling - Categorical Variables (object dtype): - These need to be one-hot encoded before logistic regression, as the model requires all inputs to be numeric. - Constant or non-informative columns (to be dropped): - EmployeeNumber: Identifier. Only one table so not necessary for any merging. - EmployeeCount: Always 1. - Over18: Always “Y”. - StandardHours: Always 80. - These will be removed to avoid introducing noise or unnecessary dimensionality.

Shape of dataset: (1470, 35)
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Y Yes 11 3 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 Y No 23 4 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Y Yes 15 3 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Y Yes 11 3 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 Y No 12 3 4 80 1 6 3 3 2 2 2 2
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                  1470 non-null   int64 
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   int64 
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   int64 
 25  RelationshipSatisfaction  1470 non-null   int64 
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   int64 
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   int64 
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 402.1+ KB
None

Column validation

  • All expected columns are confirmed to be present and are correctly named (no spaces, misspellings, etc.).
Column check passed: All expected columns are present.

Drop non-informative columns

  • Columns that do not provide meaningful information are removed:
    • EmployeeNumber, EmployeeCount, Over18, StandardHours
  • Removing these columns at this early stage simplifies the dataset and prevents them from accidentally influencing the data analysis or model.
Dropped columns: ['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber']
New shape: (1470, 31)

Export dataset with dropped columns

  • Since no further changes will be made in this exploratory notebook, we export the dataset that reflects the dropped columns for use in the next notebook (as data_01.csv).
Data successfully exported to '../data/processed/data_01.csv'

2. Initial data summary


  • Numeric features like MonthlyIncome and MonthlyRate have wide ranges and will require scaling.
  • Categorical features have low to moderate cardinality (max = 9), making them suitable for one-hot encoding.
  • Ordinal features (Education, JobLevel, satisfaction scores) are already numerically encoded and can be used as-is.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 31 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EnvironmentSatisfaction   1470 non-null   int64 
 9   Gender                    1470 non-null   object
 10  HourlyRate                1470 non-null   int64 
 11  JobInvolvement            1470 non-null   int64 
 12  JobLevel                  1470 non-null   int64 
 13  JobRole                   1470 non-null   object
 14  JobSatisfaction           1470 non-null   int64 
 15  MaritalStatus             1470 non-null   object
 16  MonthlyIncome             1470 non-null   int64 
 17  MonthlyRate               1470 non-null   int64 
 18  NumCompaniesWorked        1470 non-null   int64 
 19  OverTime                  1470 non-null   object
 20  PercentSalaryHike         1470 non-null   int64 
 21  PerformanceRating         1470 non-null   int64 
 22  RelationshipSatisfaction  1470 non-null   int64 
 23  StockOptionLevel          1470 non-null   int64 
 24  TotalWorkingYears         1470 non-null   int64 
 25  TrainingTimesLastYear     1470 non-null   int64 
 26  WorkLifeBalance           1470 non-null   int64 
 27  YearsAtCompany            1470 non-null   int64 
 28  YearsInCurrentRole        1470 non-null   int64 
 29  YearsSinceLastPromotion   1470 non-null   int64 
 30  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(23), object(8)
memory usage: 356.1+ KB
None
Age DailyRate DistanceFromHome Education EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel JobSatisfaction MonthlyIncome MonthlyRate NumCompaniesWorked PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
count 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000
mean 36.923810 802.485714 9.192517 2.912925 2.721769 65.891156 2.729932 2.063946 2.728571 6502.931293 14313.103401 2.693197 15.209524 3.153741 2.712245 0.793878 11.279592 2.799320 2.761224 7.008163 4.229252 2.187755 4.123129
std 9.135373 403.509100 8.106864 1.024165 1.093082 20.329428 0.711561 1.106940 1.102846 4707.956783 7117.786044 2.498009 3.659938 0.360824 1.081209 0.852077 7.780782 1.289271 0.706476 6.126525 3.623137 3.222430 3.568136
min 18.000000 102.000000 1.000000 1.000000 1.000000 30.000000 1.000000 1.000000 1.000000 1009.000000 2094.000000 0.000000 11.000000 3.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 30.000000 465.000000 2.000000 2.000000 2.000000 48.000000 2.000000 1.000000 2.000000 2911.000000 8047.000000 1.000000 12.000000 3.000000 2.000000 0.000000 6.000000 2.000000 2.000000 3.000000 2.000000 0.000000 2.000000
50% 36.000000 802.000000 7.000000 3.000000 3.000000 66.000000 3.000000 2.000000 3.000000 4919.000000 14235.500000 2.000000 14.000000 3.000000 3.000000 1.000000 10.000000 3.000000 3.000000 5.000000 3.000000 1.000000 3.000000
75% 43.000000 1157.000000 14.000000 4.000000 4.000000 83.750000 3.000000 3.000000 4.000000 8379.000000 20461.500000 4.000000 18.000000 3.000000 4.000000 1.000000 15.000000 3.000000 3.000000 9.000000 7.000000 3.000000 7.000000
max 60.000000 1499.000000 29.000000 5.000000 4.000000 100.000000 4.000000 5.000000 4.000000 19999.000000 26999.000000 9.000000 25.000000 4.000000 4.000000 3.000000 40.000000 6.000000 4.000000 40.000000 18.000000 15.000000 17.000000
Unique values per column:
Attrition                      2
Gender                         2
PerformanceRating              2
OverTime                       2
MaritalStatus                  3
BusinessTravel                 3
Department                     3
JobInvolvement                 4
StockOptionLevel               4
RelationshipSatisfaction       4
EnvironmentSatisfaction        4
WorkLifeBalance                4
JobSatisfaction                4
Education                      5
JobLevel                       5
EducationField                 6
TrainingTimesLastYear          7
JobRole                        9
NumCompaniesWorked            10
PercentSalaryHike             15
YearsSinceLastPromotion       16
YearsWithCurrManager          18
YearsInCurrentRole            19
DistanceFromHome              29
YearsAtCompany                37
TotalWorkingYears             40
Age                           43
HourlyRate                    71
DailyRate                    886
MonthlyIncome               1349
MonthlyRate                 1427
dtype: int64

Target variable distribution: Attrition

  • 83.88% of employees stayed (Attrition = No)
  • 16.12% of employees left (Attrition = Yes)
  • There is a significant class imbalance - the majority class (non-attrition) dominates the dataset. This can lead to true positives (predicted and actual Attrition = Yes instances) being ignored by the model.
  • To mitigate this, we’ll use the class_weight='balanced' parameter in models like logistic regression, which adjusts the loss function to penalize misclassifying the minority class more heavily.
  • Also, we will use a technique called SMOTE, which oversamples the minority class in a way that does not alter the nature of the data.
Count Percentage
Attrition
No 1233 83.88
Yes 237 16.12

3. Univariate analysis


We analyze the distribution of each feature independently.

  • Numeric features: visualized using histograms and boxplots.
  • Categorical features: visualized using countplots to show category frequency.

Numeric features

  • Right-skewed distributions are observed in MonthlyIncome, TotalWorkingYears, YearsAtCompany, and DistanceFromHome. These may benefit from log transformation to reduce the influence of extreme values.
  • Distributions for ordinal features like Education, JobLevel, JobInvolvement, and the various satisfaction scores are clustered around a few discrete integer values.
    • These represent categorical levels encoded as integers and can be left unscaled.
  • Variables such as YearsSinceLastPromotion, YearsWithCurrManager, and NumCompaniesWorked show strong peaks at zero, capturing employees with little prior experience or recent role changes.
    • These may have nonlinear effects on attrition:
      • For example, the risk of attrition might stay flat for several years, then spike suddenly after a long period without promotion or job change.
  • Salary-related variables (HourlyRate, DailyRate, MonthlyRate, MonthlyIncome) have varying scales, which can be more easily compared by standardizing their values.

Demographics

  • Age shows a slightly right-skewed distribution, with most employees between 30 and 40 years old.
  • DistanceFromHome is heavily right-skewed, indicating that most employees live within 10 km (~ 6.2 miles) of the workplace.
  • Education is a categorical feature peaking at level 3, with level 5 describing the lowest number of employees.

These features may relate to attrition through commute stress, career stage, or not being properly qualified for the position.

Compensation

Note: Although StockOptionLevel is an ordinal categorical variable representing discrete levels (0–3), we treat it as numerical here purely for the purpose of visualizing its distribution. For modeling, it should be treated as a categorical feature to avoid implying linear relationships between the levels.

  • HourlyRate, DailyRate, and MonthlyRate appear uniformly distributed, suggesting limited variability and therefore limited predictive value for modeling.
  • MonthlyIncome is right-skewed with a long tail and several high outliers, indicating a wide income disparity among employees.
  • PercentSalaryHike is moderately skewed right, with most employees receiving raises between 11% and 15%.
  • StockOptionLevel is heavily concentrated at 0 and 1, with relatively few employees receiving higher stock options.
  • JobLevel is concentrated at levels 1 and 2, implying that most employees are at the lower rungs of the organizational hierarchy.
  • PerformanceRating is almost entirely at level 3, perhaps due to a lack of variation in evaluations.

While most compensation variables are evenly spread, actual monthly income, percent salary hikes, and stock option levels show more variation — which may reflect underlying compensation policies for organizational rank (JobLevel) and/or performance-based incentives (PerformanceRating).

Satisfaction and engagement

  • EnvironmentSatisfaction, JobSatisfaction, and RelationshipSatisfaction all have their largest counts at levels 3 and 4, suggesting most employees report moderate to high satisfaction. However, there are also a significant number of instances for the lower two levels for these features, which are possible areas of potential improvement.
    • RelationshipSatisfaction most likely refers to personal relationships (spouse or partner), not interpersonal relationships between employees, although this isn’t specified for the dataset.
  • JobInvolvement and WorkLifeBalance are heavily concentrated at level 3, indicating a generally engaged workforce with a healthy work-life balance, although the number of those reporting levels 1 and 2 is lower but significant.

According to the data, most employees feel moderately satisfied and involved, but there is some room for improvement to target the strong minority who report lower levels of these metrics.

Tenure and career

  • TotalWorkingYears, YearsAtCompany, and YearsInCurrentRole display long right tails, indicating a small group of highly tenured individuals.
    • There is a curious spike at ~ 7.5 years for YearsInCurrentRole - perhaps this represents a group that is ripe for a promotion.
  • TrainingTimesLastYear shows distinct spikes, most commonly at 2–3 training sessions.
  • YearsSinceLastPromotion shows mostly recent promotions, though some employees have not been promoted for over a decade.
  • YearsWithCurrManager shows clustering at low values (around ~ 0 and ~ 2.0), suggesting frequent managerial changes.
    • This distribution is very similar to YearsInCurrentRole (showing a similar spike around 7.5 years), pointing out a subset of employees experiencing career stagnation.
  • NumCompaniesWorked also has a right-skewed distribution, with many employees having worked at one or two companies, and fewer with broader external experience.

These patterns point to a predominantly early-career workforce with frequent recent promotions and high managerial turnover, though a minority of employees remain in the same roles or under the same managers for extended periods.

Categorical features

Role and department

  • Department is dominated by employees in Research & Development, followed by Sales, with very few in Human Resources.
  • JobRole shows that most employees have the titles Sales Executive, Research Scientist, and Laboratory Technician, while there is a lower representation for director or manager-level roles (as one would expect).
    • The disproportionately high number of Sales Executives relative to Sales Representatives may reflect either inflated titling practices or a focus on high-value client relationships over mass lead generation from an abundance of lower-rung employees (cold calling, mass emails, etc.).
  • Suprisingly, EducationField is concentrated in Life Sciences and Medical, with other fields such as Marketing and Technical Degree trailing behind.
    • While IBM is not typically associated with large medical or life sciences teams, this dataset is synthetic and intended for modeling purposes, so the high representation of these education fields likely reflects simulated variety rather than the company’s actual workforce.

Overall, the workforce is concentrated in research and sales functions, with a high representation of life sciences and medical educational backgrounds — suggesting the dataset simulates a company involved in scientific or healthcare-related analytics, despite being labeled as IBM.

Demographics

  • Gender shows a roughly 60/40 split between male and female.
  • MaritalStatus shows that a majority of employees are married, followed by single and divorced individuals.
    • The higher proportion of married employees may correlate with longer tenure (perhaps because of having children).

Work pattern

  • BusinessTravel: Most employees either travel rarely for business or not at all. Very few travel frequently.
  • OverTime: The majority of employees do not work overtime, though a substantial minority does.

The minority of employees who frequently travel or work overtime may suffer from burnout that leads to attrition.

4. Affect on attrition


  • Attrition is most common among (relatively) young employees who earn less, hold lower-level roles, and receive fewer opportunities to grow within the organization.
  • Long commutes and overtime work also appear to contribute significantly to employee attrition.
  • This suggests that concentrating efforts or investment to support employees that may be new to the workforce, and/or those that are required to travel and work long hours, may have a significant impact.

Numeric features vs Attrition

Demographics

  • Age: Employees who left the company skew younger, with a noticeable peak in the late 20s – early 30s range. Those who stayed are more evenly distributed across older age groups, suggesting that younger employees may be more prone to leave.
  • DistanceFromHome: There’s a wider spread for employees who left, indicating that longer commutes might correlate with higher attrition risk.
  • Education: Distributions are similar across both groups, implying that education level likely has minimal impact on attrition.

Overall, Age and DistanceFromHome may be useful predictors, while Education appears less relevant.

Compensation

Overall, the plots suggest that compensation structure — especially total monthly income and long-term incentives like stock options — may play a meaningful role in employee attrition risk.

Note: Although StockOptionLevel is an ordinal categorical variable representing discrete levels (0–3), we treat it as numerical here purely for the purpose of visualizing its distribution. For modeling, it should be treated as a categorical feature to avoid implying linear relationships between the levels.

  • MonthlyIncome, DailyRate: Employees who stayed tend to have higher and more widely distributed incomes. Those who left cluster more tightly around lower income levels.
    • This may indicate that employees with lower salaries are more likely to leave, which is expected.
  • PercentSalaryHike: There is a subtle difference where retained employees received slightly more frequent or higher salary hikes.
    • Although the difference is modest, a small cumulative effect over time might influence retention.
  • StockOptionLevel: Employees who stayed had slightly more presence at higher stock option levels.
    • This may reflect better long-term incentives provided to retained employees, suggesting stock options could act as a retention booster.
  • Other compensation variables like HourlyRate, and MonthlyRate do not show strong separation, suggesting they may be less influential or redundant with MonthlyIncome.

Tenure and career

These patterns suggest that attrition is more common among employees with shorter tenure, fewer internal promotions, and more prior employers.

  • TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, and YearsWithCurrManager are all lower on average for those who left, indicating that shorter tenures are associated with higher attrition risk. This may reflect a lack of long-term engagement or low satisfaction early in term of employment.
  • YearsSinceLastPromotion shows minimal difference between attrition groups, indicating that promotion timing alone may not be a significant driver of employee turnover.
  • NumCompaniesWorked: while employees who left include a more instances indicating many prior employers (indicated by the fatter tail towards higher values), their median NumCompaniesWorked is lower than that of those who stayed, suggesting that attrition may also be common among employees with limited prior experience.
  • TrainingTimesLastYear: Employees who left tend to receive slightly less training than those who stayed, with fewer individuals receiving 3 or more sessions. This may reflect a subtle link between lower development investment and attrition risk.

Satisfaction and engagement

While not all variables show strong separation, JobSatisfaction, EnvironmentSatisfaction, WorkLifeBalance, and JobLevel stand out as having visually apparent associations with attrition.

NOTE: While most features shown are ordinal categorical (JobSatisfaction, WorkLifeBalance, etc), they are treated here as quasi-continuous solely to aid visual exploration of distributions.

  • EnvironmentSatisfaction: Employees who stayed tend to report higher environmental satisfaction compared to those who left.
  • JobInvolvement: Difference is minimal.
  • JobLevel: Attrition appears more common among employees at lower job levels (especially level 1), while those in higher positions tend to stay.
  • JobSatisfaction: A higher proportion of employees with low satisfaction left the company, indicating a clear link between job satisfaction and attrition.
  • PerformanceRating: This feature appears largely uniform across attrition groups.
  • RelationshipSatisfaction: Employees with lower relationship satisfaction scores are slightly more represented among those who left.
  • WorkLifeBalance: Attrition is more concentrated among employees who rated their work-life balance poorly (level 1 or 2).

Categorical features vs Attrition

Role and department breakdown

  • Department
    • Attrition is highest in Sales and Human Resources, suggesting these departments may involve higher stress or lower engagement, while Research & Development shows stronger retention, likely due to more specialized, stable roles.
  • Job role
    • Sales Representatives and Lab Technicians face the steepest attrition, highlighting a potential need for better support or career development in high-turnover roles, whereas leadership and research positions demonstrate strong retention.
  • Education
    • Attrition is higher among employees with backgrounds in Human Resources, Marketing, and Technical Degrees, which may reflect dissatisfaction within those roles, while fields like Life Sciences and Medical show stronger retention, possibly due to better support from the organization and alignment of education and job expectations.

Demographics

  • Gender shows little predictive value, with similar attrition rates across males and females.
  • MaritalStatus, however, reveals that single employees are significantly more likely to leave, perhaps reflecting differences in financial stability or lifestyle priorities (such as having children).

Work pattern and attrition

  • BusinessTravel: Employees who travel frequently show significantly higher attrition rates. Those who travel rarely have lower rates, and non-travel employees show the lowest rate. There is a clear positive relationship between the amount of travel and attrition rate.
  • OverTime: There is a huge increase in attrition among employees who work overtime, reinforcing the idea that excessive workload contributes to dissatisfaction and departure.
  • JobLevel: Generally, attrition decreases as job level increases. Entry-level employees (JobLevel = 1) show the highest attrition, while mid to senior levels (3–5) show better retention. There is a rise in attrition at job level 3 which slightly disrupts this trend, warranting further investigation of the specific conditions of employment at this level.
  • StockOptionLevel: Employees with no stock options (0) have the highest attrition. Those with stock options at levels 1 to 3 show lower attrition, suggesting that equity incentives may help with retention. Notably, StockOptionLevel 3 has worse retention than levels 1 and 2.

Employees with heavy travel or overtime demands face much higher attrition - a possible reflection of poor work-life balance. Lower job levels and minimal stock options are also linked to higher attrition, suggesting that advancement and long-term incentives play a key role in retention.

5. Feature Correlation


  • Salary, job level, and tenure features are tightly interlinked, signaling potential redundancy that could inflate model importance unless explicitly controlled for.
  • In contrast, satisfaction metrics and rate-based pay features stand apart—offering potentially unique signals that reflect individual experience rather than structural seniority.

Correlation heatmap

Compensation and tenure metrics tend to move together, while satisfaction, training, and rate features operate more independently.

  • JobLevel, MonthlyIncome, and TotalWorkingYears are tightly correlated, reflecting growth with seniority.
  • Tenure metrics like YearsAtCompany, YearsSinceLastPromotion, and YearsWithCurrManager also show strong internal alignment.
  • PerformanceRating and PercentSalaryHike are moderately linked, hinting at structured raise policies.
  • Most satisfaction and rate-based pay features (DailyRate, HourlyRate) show low correlation with other metrics.
  • Negative correlations are rare, such as between NumCompaniesWorked and YearsWithCurrManager.

Correlation pairs

  • Strong correlations between JobLevel, MonthlyIncome, and TotalWorkingYears reflect a predictable hierarchy: tenure drives advancement, which drives pay.
  • Similarly, YearsAtCompany, YearsInCurrentRole, and YearsWithCurrManager are linked, capturing overlapping aspects of employee longevity.
  • The pairing of PercentSalaryHike and PerformanceRating suggests a structured, performance-tied raise system—potentially redundant in modeling.
Highly correlated numeric feature pairs (|corr| > 0.7):
Feature 1 Feature 2 Correlation AbsCorr
134 JobLevel MonthlyIncome 0.950300 0.950300
141 JobLevel TotalWorkingYears 0.782208 0.782208
198 PercentSalaryHike PerformanceRating 0.773550 0.773550
168 MonthlyIncome TotalWorkingYears 0.772893 0.772893
249 YearsAtCompany YearsWithCurrManager 0.769212 0.769212
247 YearsAtCompany YearsInCurrentRole 0.758754 0.758754
251 YearsInCurrentRole YearsWithCurrManager 0.714365 0.714365

EDA summary


Key Insights from EDA:

  • Target imbalance:
    • Only ~16% of employees in the dataset have Attrition = Yes, indicating significant class imbalance. Future modeling should use metrics like ROC-AUC or recall instead of just accuracy.
  • Strong predictors identified:
    • Employees who work OverTime are nearly 3× more likely to leave.
    • Low JobSatisfaction, shorter tenure (YearsAtCompany), and low WorkLifeBalance are also associated with higher attrition.
    • Younger employees, low income, those with a longer commute (DistanceFromHome) and those in certain JobRoles (Sales, Laboratory Technician, …) appear to be more likely to leave.
  • Feature quality:
    • No missing values or duplicates detected.
    • All columns passed structure validation.
    • EmployeeCount, StandardHours, and Over18 show no variance and were dropped, along with the identifying column.
    • No negative or illogical values in numeric fields.
  • Correlation observations:
    • Strong correlations cluster around compensation and tenure.
    • Satisfaction, engagement, and location-related variables remain largely independent, offering distinct, potentially valuable signals for modeling attrition.

Next Steps:

  1. Encode categorical variables appropriately for modeling.
  2. Scale numeric features if using distance-based or linear models.
  3. Stratify training/test split to preserve class imbalance.
  4. Prepare data for model interpretability.

The dataset appears clean and predictive, with several features that are both statistically and intuitively linked to attrition.